On termination detection in crash-prone distributed systems with failure detectors

نویسندگان

  • Neeraj Mittal
  • Felix C. Freiling
  • Subbarayan Venkatesan
  • Lucia Draque Penso
چکیده

We investigate the problem of detecting termination of a distributed computation in systems where processes can fail by crashing. Specifically, when the communication topology is fully connected, we describe a way to transform any termination detection algorithm A that has been designed for a failure-free environment into a termination detection algorithm B that can tolerate process crashes. Our transformation assumes the existence of a perfect failure detector. We show that a perfect failure detector is in fact necessary to solve the termination detection problem in a crash-prone distributed system even if at most one process can crash. Let μ(n,M) and δ(n,M) denote the message complexity and detection latency, respectively, of A when the system has n processes and the underlying computation exchanges M application messages. The message complexity of B is O(n + μ(n, 0)) messages per failure more than the message complexity of A. Also, its detection latency is O(δ(n, 0)) per failure more than that of A. Furthermore, application message size increases by at most log( f + 1) bits, where f is the actual number of processes that fail during an execution. We show that, when the communication topology is fully connected, under certain realistic assumption, any fault-tolerant termination detection algorithm can be forced to exchange Ω(n f ) control messages in the worst-case even when at most one process may be active initially and the underlying computation does not exchange any application messages. This implies that our transformation is optimal in terms of message complexity when μ(n, 0) = O(n). The fault-tolerant termination detection algorithm resulting from the transformation satisfies three desirable properties. First, it can tolerate the failure of up to n − 1 processes. Second, it does not impose any overhead on the fault-sensitive termination detection algorithm until one or more processes crash. Third, it does not block the application at any time. Further, using our transformation, we derive a fault-tolerant termination detection algorithm that is the most efficient fault-tolerant termination detection algorithm that has been proposed so far to our knowledge. Our transformation can be extended to arbitrary communication topologies provided process crashes do not partition the system. c © 2008 Elsevier Inc. All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Termination Detection in an Asynchronous Distributed System with Crash-Recovery Failures

We revisit the problem of detecting the termination of a distributed application in an asynchronous message-passing model with crash-recovery failures and failure detectors. We derive a suitable definition of termination detection in this model but show that this definition is impossible to implement unless you have a failure detector which can predict the future. We subsequently weaken the pro...

متن کامل

Termination Detection in Systems Where Processes May Crash and Recover —

An algorithm solving the termination detection problem observes a computation of a distributed system and announces “termination” if the computation has come to an end. This work addresses termination detection in systems where processes fail by crashing and may restart later on. The new definition of robust-restricted termination sensible in the crash-recovery model is developed. A computation...

متن کامل

Efficient Reduction for Wait-Free Termination Detection in a Crash-Prone Distributed System

We investigate the problem of detecting termination of a distributed computation in systems where processes can fail by crashing. Specifically, when the communication topology is fully connected, we describe a way to transform any termination detection algorithm A that has been designed for a failure-free environment into a termination detection algorithm B that can tolerate process crashes. Ou...

متن کامل

Efficient Reductions for Wait-Free Termination Detection in Faulty Distributed Systems

We investigate the problem of detecting termination of a distributed computation in asynchronous systems where processes can fail by crashing. More specifically, for both fully and arbitrarily connected communication topologies, we describe efficient ways to transform any fault-sensitive termination detection algorithm A, that has been designed for a failure-free environment , into a wait-free ...

متن کامل

Encapsulating Failure Detection: From Crash to Byzantine Failures

Separating different aspects of a program, and encapsulating them inside well defined modules, is considered a good engineering discipline. This discipline is particularly desirable in the development of distributed agreement algorithms which are known to be difficult and error prone. For such algorithms, one aspect that is important to encapsulate is failure detection. In fact, a complete enca...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Parallel Distrib. Comput.

دوره 68  شماره 

صفحات  -

تاریخ انتشار 2008